Understanding Decision Tree

Geometric Intuition of Decision Trees: Axis-Parallel Hyperplanes

A Decision Tree is a flowchart-like structure that recursively splits data based on feature values. Each split creates decision boundaries that are parallel to the feature axes, which makes decision trees highly interpretable. Below, we explore how a decision tree models the given dataset.

Dataset Used

Outlook	Temp.	Humidity	Windy	Play Golf
Rainy	Hot	High	False	No
Rainy	Hot	High	True	No
Overcast	Hot	High	False	Yes
Sunny	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes
Sunny	Cool	Normal	True	No
Overcast	Cool	Normal	True	Yes
Rainy	Mild	High	False	No
Rainy	Cool	Normal	False	Yes
Sunny	Mild	Normal	False	Yes
Rainy	Mild	Normal	True	Yes
Overcast	Mild	High	True	Yes
Overcast	Hot	Normal	False	Yes
Sunny	Mild	High	True	No

This dataset contains four predictor variables (Outlook, Temperature, Humidity, and Windy) and a target variable (Play Golf?), which the decision tree uses to classify whether playing golf is feasible or not.

How the Decision Tree Works on This Dataset

Step 1: Splitting on Outlook (First Decision Boundary)
- The tree first checks Outlook because it is the most informative feature.
- This results in three distinct groups in the feature space:
  - If Outlook = Overcast → Always "Yes" (No further splitting needed).
  - If Outlook = Sunny → Further splits occur based on Windy.
  - If Outlook = Rainy → Further splits occur based on Humidity.
Step 2: Further Splitting on Other Features
- For Sunny Days:
  - If it is Windy = False → "Yes" (can play golf).
  - If it is Windy = True → "No" (not suitable).
- For Rainy Days:
  - If Humidity = High → "No" (too humid to play).
  - If Humidity = Normal → "Yes" (good conditions).
Final Decision Boundaries
- The dataset is now divided into smaller rectangular decision regions, each corresponding to a final classification (Yes/No).
- Each leaf node in the decision tree represents a final classification for a set of feature values.

Geometric Intuition: How the Tree Creates Decision Boundaries

The decision tree divides the feature space into axis-parallel regions.
Each split corresponds to a vertical or horizontal boundary in the space of selected features.
The resulting decision regions are rectangular and aligned with feature axes.
This structure is different from models like logistic regression, which create diagonal decision boundaries.

Example: Decision Boundaries for Outlook & Humidity

If we visualize a 2D feature space with Outlook on the x-axis and Humidity on the y-axis:

The first split at Outlook = Overcast creates a vertical boundary separating "Yes" from the rest.
The second split at Humidity = High for Rainy days creates a horizontal boundary in the Rainy region.
Further splits (like Windy for Sunny days) refine the classification into more rectangular segments.

Key Takeaways

✅ Decision Trees create axis-aligned decision boundaries (rectangular regions).
✅ Each split represents an IF-ELSE condition, forming nested rules.
✅ Deeper trees create more refined partitions, improving accuracy but risking overfitting.

This makes Decision Trees intuitive and interpretable, but they may struggle when dealing with complex, non-rectangular patterns in data. Would you like a visual example of the decision regions? 🚀

Explain Entropy?

Entropy

Entropy is a measure of unpredictability. A feature with higher entropy has more random values than the one with lower entropy.
Suppose we tossed a coin 4 times,

Output	P(H)	P(T)	Entropy	Interpretation
{H, H, T, T}	50%	50%	1	Output is a random event
{T, T, T, H}	25%	75%	0.8113	The result is less random i.e., biased coin

For a given random variable Y which has K values, then entropy of Y is:

H(y) = - Σ P(y_i) log_b (P(y_i))

P(y_i) = P(Y = y_i)

typ. b = 2 (log) or b = e (ln)

For example, for the above play golf example, there are two classes:

H(y) = -P(y = yes) log (P(y = yes)) - P(y = no) log (P(y = no))

H(y) = - (9/14) log (9/14) - (5/14) log (5/14) = 0.94

Comparing entropies for different cases for 2-class
                    example
                        
                                    Sr. No.
                                    P(Y+)
                                    P(Y-)
                                    Entropy
                                
                                    1
                                    99%
                                    1%
                                    0.08
                                
                                    2
                                    50%
                                    50%
                                    1.00
                                
                                    3
                                    100%
                                    0%
                                    0.00
                                
If both classes are equally probable, we have a maximum entropy value of 1.
If one class fully dominates, the entropy value becomes 0.

Comparing entropies for different cases for Multi-class example
                    
                If all classes are equiprobable, the entropy will be maximum, i.e., uniform distribution.

If one class has 100% probability and others are zero, the entropy is minimum.
In this figure, entropy of skewed distribution will be lesser than that of normally distributed.
                        More peaked is the distribution, lesser will be the entropy.
• Kullback-Leibler Divergence or relative entropy
                            
KL Divergence measures the difference between two probability distributions over the
                                    same variable X.
For Discrete probability distributions P & Q on the same probability
                                    space X

                                    \( D_{KL}(P || Q) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \)
                                
For distributions P and Q of a continuous random variable

                                    \( D_{KL}(P || Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \,dx \)
                                
p(x) and q(x) are probability densities of P & Q

Sr. No.	P(Y+)	P(Y-)	Entropy
1	99%	1%	0.08
2	50%	50%	1.00
3	100%	0%	0.00

Explain Information Gain

Information Gain

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding the attribute that returns the highest information gain (i.e., the most homogeneous branches).

Example:

As we know, entropy of golf example was 0.94

\[ H(y) = -P(y = \text{yes}) \log P(y = \text{yes}) - P(y = \text{no}) \log P(y = \text{no}) \]

\[ H(y) = - \left( \frac{9}{14} \log \frac{9}{14} + \frac{5}{14} \log \frac{5}{14} \right) = 0.94 \]

The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy.

Outlook vs Play Golf

Outlook	Yes	No	Entropy	W.E(Play, Outlook)
Sunny	3	2	0.9710	0.3468
Overcast	4	0	0.0000	0.0000
Rainy	3	2	0.9710	0.3468
Gain	0.246

Humidity vs Play Golf

Humidity	Yes	No	Entropy	W.E(Play, Humidity)
High	3	4	0.9852	0.4926
Normal	6	1	0.5917	0.2958
Gain	0.152

Temperature vs Play Golf

Temp	Yes	No	Entropy	W.E(Play, Temp)
Hot	2	2	1.0000	0.2857
Mild	4	2	0.9183	0.3936
Cool	3	1	0.8113	0.2318
Gain	0.029

Windy vs Play Golf

Windy	Yes	No	Entropy	W.E(Play, Windy)
FALSE	6	2	0.8113	0.4636
TRUE	3	3	1.0000	0.4286
Gain	0.048

\[ \text{Gain}(T, X) = \text{Entropy}(T) - W.\text{Entropy}(T, X) \]

\[ G(\text{Play Golf}, \text{Outlook}) = \text{Entropy}(\text{Play Golf}) - W.\text{Entropy}(\text{Play Golf}, \text{Outlook}) \]

\[ 0.94 - 0.693 = 0.247 \]

\[ IG(y, D_i) = \sum_{i=1}^{n} \frac{|D_i|}{|D|} H_{D_i}(y) - H_D(y) \]

Build a Decision Tree using Information Gain
                                with an ExampleStep 1: Compute Entropy for the Target Variable (Play Golf?)Entropy formula:

                                H(S) = - ∑ p(i) log₂ (p(i))
                            
From the dataset:
Total samples = 14
Yes (Play Golf) = 9
No (Play Golf) = 5

                                H(S) = - (9/14 log₂ (9/14) + 5/14 log₂ (5/14))
                            

                                H(S) = - (0.642 × -0.459 + 0.358 × -0.807) = 0.940
                            
Step 2: Compute Information Gain for Each Attribute(a) Information Gain for "Outlook"Splitting based on Outlook:
Sunny: (3 No, 2 Yes) → Entropy = 0.970
Overcast: (4 Yes) → Entropy = 0.000 (pure)
Rainy: (3 Yes, 2 No) → Entropy = 0.970

                                H(Outlook) = (5/14 × 0.970) + (4/14 × 0.000) + (5/14 × 0.970) = 0.693
                            

                                IG(Outlook) = H(S) - H(Outlook) = 0.940 - 0.693 = 0.247
                            
(b) Information Gain for "Humidity"High: (3 No, 4 Yes) → Entropy = 0.985
Normal: (6 Yes, 1 No) → Entropy = 0.592

                                H(Humidity) = (7/14 × 0.985) + (7/14 × 0.592) = 0.788
                            

                                IG(Humidity) = H(S) - H(Humidity) = 0.940 - 0.788 = 0.152
                            
(c) Information Gain for "Wind"Weak: (6 Yes, 2 No) → Entropy = 0.811
Strong: (3 Yes, 3 No) → Entropy = 1.000

                                H(Wind) = (8/14 × 0.811) + (6/14 × 1.000) = 0.892
                            

                                IG(Wind) = H(S) - H(Wind) = 0.940 - 0.892 = 0.048
                            
Step 3: Choose the Best AttributeOutlook: 0.247 (Highest Information Gain)
Humidity: 0.152
Wind: 0.048
Since Outlook has the highest IG, it becomes the root node.
Step 4: Split and Recursively RepeatNow, we divide the dataset based on Outlook:
(a) Subtree for "Overcast"All are Yes, so it's a leaf node.
(b) Subtree for "Rainy"Choices:
Humidity: 0.151
Wind: 0.020
Since Humidity has the highest IG, we split on Humidity.
(c) Subtree for "Sunny"Humidity: 0.970
Splitting on Humidity.
SummaryCalculated Entropy for the target variable.
Computed Information Gain for each feature.
Chose the best attribute (Outlook) as the root node.
Repeated the process recursively for sub-nodes.
R₁: IF (Outlook=Sunny) AND (Windy=FALSE) THEN Play=Yes
R₂: IF (Outlook=Sunny) AND (Windy=TRUE) THEN Play=No
R₃: IF (Outlook=Overcast) THEN Play=Yes
R₄: IF (Outlook=Rainy) AND (Humidity=High) THEN Play=No
R₅: IF (Outlook=Rainy) AND (Humidity=Normal) THEN Play=Yes

                                                              Outlook
                          //   ||  \\
                                ________//    ||   \\_____________
                                |            |                |
                             Sunny         Overcast          Rainy
                               |              |               |
                              Windy        Play=Yes          Humidity
                            ____|____                     ____|____
                            |        |                    |         |
                          False    True                High      Normal
                        Play=Yes  Play=No           Play=No    Play=Yes
                                    
                                

Explain Gini Impurity

Gini Impurity

Gini Impurity is a metric used to measure how often a randomly chosen element from the set would be incorrectly classified if it were randomly labeled according to the class distribution in the subset.

It is used in decision trees, especially in algorithms like **CART (Classification and Regression Trees)**, to decide the best split at each node.

Formula:

\[ Gini(D) = 1 - \sum_{i=1}^{c} P_i^2 \]

Where:

\(P_i\) is the probability of class \(i\) in the dataset.
\(c\) is the total number of classes.

Example:

In the Play Golf dataset, we have **Yes (9)** and **No (5)** instances.

\[ Gini = 1 - \left( \left(\frac{9}{14}\right)^2 + \left(\frac{5}{14}\right)^2 \right) \]

\[ Gini = 1 - \left(0.408 + 0.127\right) = 0.465 \]

The lower the Gini Impurity, the purer the dataset.

Outlook vs Play Golf

Outlook	Yes	No	Gini	W.G(Play, Outlook)
Sunny	3	2	0.48	0.171
Overcast	4	0	0.00	0.000
Rainy	3	2	0.48	0.171
Gini Gain	0.122

Humidity vs Play Golf

Humidity	Yes	No	Gini	W.G(Play, Humidity)
High	3	4	0.49	0.245
Normal	6	1	0.24	0.120
Gini Gain	0.100

Temperature vs Play Golf

Temp	Yes	No	Gini	W.G(Play, Temp)
Hot	2	2	0.50	0.143
Mild	4	2	0.44	0.189
Cool	3	1	0.38	0.136
Gini Gain	0.033

Windy vs Play Golf

Windy	Yes	No	Gini	W.G(Play, Windy)
FALSE	6	2	0.38	0.218
TRUE	3	3	0.50	0.214
Gini Gain	0.033

\[ Gini\ Gain(T, X) = Gini(T) - W.Gini(T, X) \]

\[ Gini(\text{Play Golf}, \text{Outlook}) = Gini(\text{Play Golf}) - W.Gini(\text{Play Golf}, \text{Outlook}) \]

\[ 0.465 - 0.343 = 0.122 \]

Summary

Gini Impurity measures how mixed the classes are in a dataset.
The lower the Gini value, the purer the dataset.
Decision trees using CART split on the attribute with the **lowest** weighted Gini Impurity.

                    Overfitting vs. Underfitting in Decision Trees
                
                    Overfitting
                    Definition: The model learns the training data too well, capturing noise and irrelevant patterns.
Causes: Very deep trees, lack of pruning, too many features.
Effects: High training accuracy but poor test accuracy.
Solutions: Pruning, limiting tree depth, using ensemble methods like Random Forest.

                
                    Underfitting
                    Definition: The model is too simple and fails to capture patterns in the data.
Causes: Shallow trees, too much pruning, not using enough features.
Effects: Poor performance on both training and test data.
Solutions: Increase tree depth, use more features, reduce pruning.

                
                    Comparison Table
                    
                        
                            Criteria
                            Overfitting
                            Underfitting
                        

                            Model Complexity
                            Too complex
                            Too simple
                        

                            Training Accuracy
                            High
                            Low
                        

                            Test Accuracy
                            Low (poor generalization)
                            Low (poor learning)
                        

                            Solution
                            Pruning, limiting depth
                            Increase depth, reduce pruning
                        

                

Criteria	Overfitting	Underfitting
Model Complexity	Too complex	Too simple
Training Accuracy	High	Low
Test Accuracy	Low (poor generalization)	Low (poor learning)
Solution	Pruning, limiting depth	Increase depth, reduce pruning

Geometric Intuition of Decision Trees: Axis-Parallel Hyperplanes

Dataset Used

How the Decision Tree Works on This Dataset

Geometric Intuition: How the Tree Creates Decision Boundaries

Example: Decision Boundaries for Outlook & Humidity

Key Takeaways

Explain Entropy?

Entropy

Comparing entropies for different cases for 2-class example

Comparing entropies for different cases for Multi-class example

• Kullback-Leibler Divergence or relative entropy

Explain Information Gain

Information Gain

Outlook vs Play Golf

Humidity vs Play Golf

Temperature vs Play Golf

Windy vs Play Golf

Build a Decision Tree using Information Gain with an Example

Step 1: Compute Entropy for the Target Variable (Play Golf?)

Step 2: Compute Information Gain for Each Attribute

(a) Information Gain for "Outlook"

(b) Information Gain for "Humidity"

(c) Information Gain for "Wind"

Step 3: Choose the Best Attribute

Step 4: Split and Recursively Repeat

(a) Subtree for "Overcast"

(b) Subtree for "Rainy"

(c) Subtree for "Sunny"

Summary

Explain Gini Impurity

Gini Impurity

Outlook vs Play Golf

Humidity vs Play Golf

Temperature vs Play Golf

Windy vs Play Golf

Summary

Overfitting vs. Underfitting in Decision Trees

Overfitting

Underfitting

Comparison Table